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Abstract 

We give a characterization of Maximum Entropy /Minimum Relative 
Entropy inference by providing two 'strong entropy concentration' theo- 
rems. These theorems unify and generahze Jaynes' 'concentration phe- 
nomenon' and Van Campenhout and Cover's 'conditional limit theorem'. 
The theorems characterize exactly in what sense a prior distribution Q con- 
ditioned on a given constraint and the distribution P minimizing D{P\\Q) 
over all P satisfying the constraint are 'close' to each other. We then apply 
our theorems to establish the relationship between entropy concentration 
and a game-theoretic characterization of Maximum Entropy Inference due 
to Tops0e and others. 



1 Introduction 



Jaynes' Maximu m Entropy (MaxEnt) Principle is a well-known principle for in- 



ductive inference Csiszar 



1981. Cover and Thomas 



1975 



1991 



19911. lTops0el.ll979l. Ivan Campenhout and Cover! . 



Griinwald and Dawidl . l2004| . It has been applied 



to statistical and machine learning problems ranging from protein modeling to 
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stock market prediction Kapur and Kesavanl . Il992| . One of its characteriza- 
tions (s o me w o uld sa y 'justifications') is the so-called concentration phenomenon 

Here is an informal version of this phenomenon, in the 



Jayned . 



words of 



19781, '19821 
Jayn cs 



"If the information incorporated into the maximum-entropy analysis 
includes all the constraints actually operating in the random exper- 
iment, then the distribution predicted by maximum entropy is over- 
whelmingly the most likely to be observed experimentally." 

For the case in which a prior distr ibution over the domain at hand is available, 
van Campenhout and Coved 1981| have proven the related conditional limit theo- 
rem. In Sections |2]|ll we provide a strong generalization of both the concentration 
phenomenon and the conditional limit theorem. In Section [5l the results of Sec- 
tion [Hare used to extend an existing game-theoretic ch aracterizatiori (again, some 
would say "justification") of Maximum Entropy due to Tops0e jl979 |. In this way, 
we provide sharper results on two of the most frequently cited characterizations 
of the maximum entropy principle. 



2 Informal Overview 

Maximum Entropy Let X be a random variable taking values in some set X, 
which (only for the time being!) we assume to be finite: X = {1, . . . ,m}. Let 
P, Q be distributions for X with probability mass functions p and q. We define 
Hq(P), the Q-entropy of P, as 



Hq(P) 



log 



p{x) 
q{x)_ 



-DiPWQ), 



(1) 



where -D(-||-) is the KuUback-Leibler (KL) divergence between P and Q Cover and Thomas 
In the usual MaxEnt setting, we are given a 'prior' distribution Q and a 



1991 



moment constraint: 



E[T{X)] = i 



(2) 



where T is some function T : X R'' for some k > (More general formulations 
with arbitrary convex constraints exist Csiszar . 1975| . but here we stick to con- 
straints of form ([2])). We define, if it exists, P to be the unique distribution over 
X that maximizes the Q-entropy over all distributions (over X) satisfying ([2]): 



P = arg max Hq(-P) 

{P:Ep[T{X)]=i} 



arg min D{P\\Q) 

{P:Ep[T{X)]=i} 



(3) 
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The MaxEnt Principle then tells us that, in absence of any further knowledge 
about the 'true' or 'posterior' distribution according to which data are distributed, 
our best guess for it is P. In practical problems we are usually not given a 
constraint of form ([2]) . Rather we are given an empirical constraint of the form 

— T{Xi) = i which we always abbreviate to 'T(") = t' (4) 

^ i=l 

The MaxEnt Principle is then usually applied as follows: suppose we are given 
an empirical constraint of form (j4]). We then have to make predictions about 
new data coming from the same source. In absence of knowledge of any 'true' 
distribution generating this data, we should make our predictions based on the 
MaxEnt distribution P for the moment constraint ([2]) corresponding to empir- 
ical constraint (jl]). P is extended to several outcomes by taking the product 
distribution. 



The Concentration Phenomenon and The Conditional Limit Theorem 

Why should this procedure make any sense? Here is one justification. If X 
is finite, and in the absence of any prior knowledge beside the constraint, one 
usually picks the uniform distribution for Q. In this case, Jaynes' 'concentration 

{ )heno menon' applies ( We are referrin g here to the version employed by iJaynes 
1978|. The theorem of I Jayned 1982l | extends this in a direction different from 
the one we consider here). It says that for all £ > 0, 



sup 



1 

-5^/,(X.)-P(X 

i=l 



3] 



> e I T(") = t 



0{e- 



(5) 



for some constant c depending on e. Here Q'^ is the n-fold product distribution of 
Q, and / is the indicator function: Ij{x) = 1 if x = j and otherwise. In words, 
for the overwhelming majority among the sequences satisfying the constraint, the 
empirical frequencies are close to the maximum entropy probabilities. It turns out 
that (jSD still holds if Q is non-uniform. For an illustration we refer to Exan iple l4.2[ 
A closely related result (Theorem 1 of Ivan Campenhout and Coverl 1981| ) is the 
conditional limit theorem (This theorem too has later been extended in several 
directions different from the one considered here; see the discussion at the end of 
Section H]). It says that 



lim Q 

n—>oo 
nt6N 



tw = i) = pU 



(6) 



where 



I TW = t) and P^{-) refer to the marginal distribution of Xi under 
t) and P respectively. 
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Our Results Both theorems above say that for some sets A, 



g"(^ I T(") = i) ^ P"(^) (7) 

In the concentration phenomenon, the set A C A"" is about the frequencies of 
individual outcomes in the sample. In the conditional limit theorem A G only 
concerns the first outcome. One might conjecture that holds asymptotically 
in a much wider sense, namely for just about any set whose probability one may be 
interested in. For examples of such sets see Example 14.21 In Theorems 14. II and 14. 31 
we show that ([7]) indeed holds for a very large class of sets; moreover, we give an 
explicit indication of the error one makes if one approximates Q{A \ T(") = t) by 
P{A). In this way we unify and strengthen both the concentration phenomenon 
and the conditional limit theorem. To be more precise, let {^n}, with Ai C A"* be 
a sequence of 'typical' sets for P in the sense that P"'{An) goes to 1 sufficiently 
fast. Then broadly speaking Theorem 14.11 shows that Q"'{An \ T^") = i) goes 
to 1 too, 'almost' as fast as P^-lAn)- Theorem 14.31 our main theorem, says 
that, if m„ is an arbitrary increasing sequence with lim^^oo '^n/'^ = 0, then for 
every (measurable) sequence Ami,Am2, ■ ■ ■ (i-e. not just the typical ones), with 
Am„ C A""", P'^iAmJ Q"(v4^„ I TW = t). In Sections we first give an 
interpretation of our strong concentration results in terms of data compression. 
We then show (Theorem 15.21) that our concentration phenomenon implies that 
the MaxEnt distribution P achieves the best minimax time-averaged logarithmic 
loss (codelength) achievable for sequential prediction of samples satisfying the 
constraint. We also characterize (Theorem 15.31 and 15.41) the precise conditions 
under which P also achieves the total (non-time averaged) minimax logarithmic 
loss. Surprisingly, the answer depends crucially on the dimensionality k of the 
constraint random vector T: for < 2, P is also best in the total sense. For 
k > 3, there exist distributions which consistently outperform P. This is related 
to the well-known fact that random walks in are transient if > 3. 

3 Mathematical Preliminaries 

The Sample Space From now on we assume a sample space C R' for some 
/ > and let X be the random vector with X{x) = x for allx E X. We reserve the 
symbol Q to refer to a distribution for X called the prior distribution (formally, 
Q is a distribution on {X,a{X)) where ct(X) is the Borel-cr-algebra generated 
by X). We will be interested in sequences of i.i.d. random variables Xi,X2, . . ., 
all distributed according to Q. Whenever no confusion can arise, we use Q also 
to refer to the joint (product) distribution of Xj^nXj. Otherwise, we use to 



4 



denote the m-fold product distribution of Q. The sample (Xi, . . . will also 
be written as X^'^\ 



The Constraint Functions T Let T = (T[i], . . . , T[fc]) be a /c-dimensional ran- 
dom vector that is (T(X)-measurable. We refer to the event {x G A" | T{x) = t} 
both as 'T(X) = f and as 'T = t\ Similarly we write Tj = t as an abbrevia- 
tion of T(Xi) = t and T*^"^ as short for (T(Xi), . . . ,T(X„)). The average of n 
observations of T will be denoted by TW := J2^=iTiXi). We assume that 
the support of X is either countable (in which case the prior distribution Q ad- 
mits a probability mass function) or that it is a connected subset of R' for some 
Z > 1 (in which case we assume that Q has a bounded continuous density with 
respect to Lebesgue measure). In both cases, we denote the probability mass 
function/density by q. If X is countable, we shall further assume that T is of the 
lattice form (which it will be in most applications): 



Definition 3.1 |Fellerl . Il968l . Page 490] A /c-dimensional lattice random vec- 
tor 

T = (T[i], . . . , T[k]) is a random vector for which there exists real-valued bi, . . . ,bk 
and hi, . . . ,hk such that, for 1 < j < k, Wx & X : T^]{x) G {bj + shj \ s G N}. 
We call the largest hi for which this holds the span of T[j] . 

If X is continuous, we shall assume that T is 'regular': 

Definition 3.2 We say a k-dimensional random vector is of regular continu- 
ous form if its distribution admits a bounded continuous density with respect to 
Lebesgue measure. 

Maximum Entropy Throughout the paper, log is used to denote logarithm 
to base 2. Let P,Q be distributions for X. We define Hq(P), the Q-entropy of 
P, as 

IIq{P) = -D{P\\Q), (8) 
where D is the K L-diverg e nce b etween P and Q. This is defined even if P or Q 



have no densities [Csiszarl . Il975| . Assume we are given a constraint of form ([2]) , 
i.e. Ep[T{X)] = i. Here T = (T[i], . . .,T[k]),i = . . . ,i[k]). We define, if it 
exists, P to be the unique distribution on X that maximizes the Q-entropy over 
all distributions on X satisfying ([2]). That is, P is given by ([3]). If Condition 1 
below holds, then P exists and is given by the exponential form (Q, as expressed 
in Proposition 13.31 below. In the condition, the notation a^b refers to the dot 
product between a and b. 
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Condition 1: There exists /? G R*^ such that Z0) = l^^^ exp{-l3'^T{x))dQ{x) 
is finite and the distribution P with density (with respect to Q) 

1 



p{x) 



—e 



-P' T(x) 



(9) 



satisfies Ep[T{X)] = I 

In our theorems, we shall simply assume that Condition 1 holds. A sufficient 
(by no means nece ssary!) r equire ment for Condition 1 is for example that Q has 
bounded support; ICsiszaii |1975[ | gives a more precise characterization. We will 
also assume in our theorems the following natural condition: 

Condition 2: The 'T-covariance matrix' S with Sj^ = Ep[T[i]Ty]]—Ep[T[i]]Ep[Ty] 
is invertible. 

S is guaranteed to exist by Con dition 1 (see any book with a treatment of ex- 
ponential families, for example, Grunwald, 2007j |) and will be singular only if 
either tj lies at the boundary of the range of T[j] for some j or if some of the Ty] 
are affine combinations of the others. In the first case, the constraint Ty] = tj 
can be replaced by restricting the sample space to {x E X \ Ty](x) = ij} and 
considering the remaining constraints for the new sample space. In the second 
case, we can remove some of the T^q from the constraint without changing the set 
of distributions satisfying it, making S once again invertible. 



Proposition 3.3 (jCsiszarl jl975[ |) Assume Condition 1 holds for Constraint 
(dj. Then 

inf {D{P\\Q) I P ■ Ep[T{X)] = i} is attained by a P of the form ^. If 
additionally, Condition 2 holds, then Condition 1 holds for only one (3 G R'^ and 
the infimum is uniquely attained by the unique P satisfying 

If Condition 1 holds, then t determines both f3 and P. 



4 The Concentration Theorems 

Theorem 4.1 (the concentration phenomenon for typical sets, lattice 
case) Assume we are given a constraint of form ^ such that T is of the lattice 
form and h = {hi, . . . , h/^) is the span ofT and such that conditions 1 and 2 hold. 
Then there exists a sequence {q} satisfying 

iim Cn = — , = 

v/(27r)^detS 

such that 
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1. Let Ai,A2i--- he an arbitrary sequence of sets with Ai G X\ For all n with 
Q{Tn = i) > , we have: 



> n-^/2^„g(A I TW = t). (10) 

Hence if 81,82, ... is a sequence of sets with 8i C X'' whose probability tends 
to 1 under P in the sense that 1 — P{8n) = 0{f{n)n~'^^'^) for some function 
f : N fin) = 0(1), then Q(i3„|TW = i) tends to 1 m the sense that 

l-Q{8n\Ti^)=i)=0{f{n)). 

2. If for all n, An C {x*^") | SILi ^(^j) — '^^^'^ holds with equality. 

As discussed in Section [5l Theorem 14.11 has apphcations for data compression. 
The relation of the Theorem to Jaynes' original concentration phenomenon is 
discussed at the end of the present section. 

Proof We need the following theorem: 

Theorem ( 'local central limit theorem for lattice random variables', 



Feller! jl968l |. page 490) Let T = {T[i], . . . , T[k]) be a lattice random vector and 



hi, . . . ,hk be the corresponding spans as in Definition 13. Ij let Ep[T{X)] = t and 
suppose that P satisfies Condition 2 with T-covariance matrix E. Let Xi, X2, . . . 
be i.i.d. with common distribution P. Let V he a. closed and bounded set in R'^. 
Let Wi, f2, . . . be a sequence in V such that for all n, P i^^^iiTi—t) / ^/n = Vn) > 0. 
Then a.s n —>■ 00, 

-p rsk^ . , _ ,,0 ^ 0. 



Here K is the density of a A;- dimensional normal distribution with mean vector 
H = t and covariance matrix S. 

Feller gives the local central limit theorem only for 1-dimensional lattice random 
variables with E[T] =0 and var[T] = 1; extending the proof to fc- dimensional 
random vectors with arbitrary means a nd coyarianc es is, however, completely 



straightforward: see XV. 7 (page 494) of [Fellerl . Il968 



The theorem shows that there exists a sequence di,d2,--- with lim„^oo c^n 
such that, for all n with P(Er=i(^i - 1) = 0) > 0, 



n"/^ p ( Y.l=,m-t) _ r.\ , 

UU^, \ ~ ; _ x/(27rn)fc det S 



m n =1 h 



P 



h 



n \ 
-Y,T^ = t]=dn (11) 



j=l '"J \ i=l 
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The proof now becomes very simple. First note that P{An \ T(") = t) = Q{An \ 
T("-) = t) (write out the definition of conditional probability and realize that 
exp(— /?-^T(a;)) = exp{—(3'^i) = constant for all x with T{x) = i. Use this to show 
that 

P{An) > P{An, = i)= P{An \ = t)P(T^) = i) (12) 

= Q(A I = t)p(TH = t). 

Clearly, with P in the role of P, the local central limit theorem is applicable to 
random vector T. Then, by ([11]), P(TW = i) = (Hj^i hj) / ^ {2TTn)'' det E(i„. 

Defining c„ := P{T^'^'> = i)n''^'^ finishes the proof of item 1. For item 2, notice 
that in this case (JT2i) holds with equality; the rest of the proof remains unchanged. 

□ 



Example 4.2 The 'Brandeis dice example' is a toy example frequently us ed by 



Jaynes and others in discussions of the MaxEnt formalism [Jayned . Il978l |. Let 
X = {1, . . . , 6} and X be the outcome in one throw of some given die. We initially 
believe (e.g. for reasons of symmetry) that the distribution of X is uniform. Then 
Q{X = j) = 1/6 for all j and -Eq[-^] =3.5. We are then told that the average 
number of spots is E[X] = 4.5 rather than 3.5. As calculated by Jaynes, the 
MaxEnt distribution P given this constraint is given by 

(p(l),...,p(6)) = (0.05435,0.07877,0.11416,0.16545,0.23977,0.34749). (13) 

By the Chernoff/Hoeffding bound, for every j G X, every e > 0, P{\n^^ J27=i ^ji-^i 
p{j)\ > e) < 2exp(— nc) for some constant c > depending on e; here Ij{X) is the 
indicator function for X = j. Theorem 14 . 1 1 then implies that Q{\n"^ Y17=i 
p{j)\ > e|TW = i) = 0{^e-'"') = 0(6-"^=') for some c' > 0. In this way we 
recover Jaynes' original concentration phenomenon ([5]): the fraction of sequences 
satisfying the constraint with frequencies close to MaxEnt probabilities p is over- 
whelmingly large. Suppose now we receive new information about an additional 
constraint: P{X = 4) = P{X = 5) = 1/2. This can be expressed as a moment 
constraint by E[{l4{X), I^^X j)'^] = (0.5,0.5)"^, where Ij is the indicator function 
of the event X = j. We can now either use P defined as in (fT3l) in the role 
of prior Q and impose the new constraint EKI^^X), I^{X))'^] = (0.5,0.5)"^, or 
use uniform Q and impose the combined constraint E[T] = £'[(T[i], T[2], T^])^] = 
(4.5, 0.5, 0.5)^, with T[i] = X, Tpj = /4(X), T[3] = I^i^X). In both cases we end up 
with a new MaxEnt distribution p{A) = p{5) = 1/2. This distribution, while still 
consistent with the original constraint E[X] = 4.5, rules out the vast majority 
of sequences satisfying it. However, we can apply our concentration phenomenon 
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again to the new MaxEnt distribution P. Let Tjj'^s denote the event that 



i=l 



i+l) 



> e. 



According to P, we still have that Xi,X2,... are i.i.d. Then by the Cher- 
noff/Hoeffding bound, for each e > 0, for G {4,5}, P^Ijji^e) is exponentially 
small. Theorem 14.11 then implies that Q"{Xjj's \ T(") = (4.5,0.5,0.5)"^) is ex- 
ponentially small too: for the overwhelming majority of samples satisfying the 
combined constraint, the sample will look just as if it had been generated by an 
i.i.d. process, even though Xi, . . . ,X„ are obviously not completely independent 
under Q'^(.|TW = (4.5,0.5,0.5)^). 

There also exists a version of Theorem 14. II for continuou s- valued random v ectors. 
This is given, along with the proof, in technical report Grunwald, 2001bl |. 

There are a few limitations to Theorem 14.11 (1) we must require that P{An) 
goes to or 1 as n — cx); (2) the continuous case needed a separate statement, 
which is caused by the more fundamental (3) it turns out that the proof technique 
used cannot be adapted to point-wise conditioning on T(") = t in the continuous 
case 



Grunwald. 2001b 



Theorem 14.31 overcomes all these problems. The price 
we pay is that, when conditioning on T(") = t, the sets Am must only refer to 
Xi, . . . ,Xfn where m is such that m/n 0; for example, m = \n/logn\ will 
work. Whenever in the case of continuous- valued T we write Q{- \ T(") = t) or 
P(- I T^") = t) we re fer to the continuo us version of these quantities. These are 
easily shown to exist Grunwald . 2001bj |. Recall that (for m < n) Q 
refers to the marginal distribution of Xi, 



, X„ 



I T(") = i) 
conditioned on T(") = i. It is 



implicitly understood in the theorem that in the lattice case, n ranges only over 
those values for which Q{T^'^^ = t) > 0. 

Theorem 4.3 (Main Theorem: the Strong Concentration Phenomenon/ 
Strong Conditional Limit Theorem) Let {rrii} be an increasing sequence with 
rrii G N, such that lim„_^oo '"^n/'^ = 0. Assume we are given a constraint of form 
([^ such that T is of the regular continuous form or of the lattice form and sup- 
pose that Conditions 1 and 2 are satisfied. Then as n ^ oo, Q"^"{- \ T(") = i) 
converges weakly to P"^"{-). 

Discussion of "weak convergence" as well as the proof (using the same key idea, 
but in volving mu c h mor e work than the proof of Theorem 14.11) is in technical 
report 



Grunwald, 2001b . 
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Related Results Theorem 14. II is related to Jaynes' original concentration phe- 
nomenon, the proof of which is based on Stirling's approximation of the factorial. 
Another cl osely related result (al so based on Stirling's approximation) is in Exam- 
ple 5.5.8 of lLi and Vitanyil 1997l |. Both results can be easily extended to prove the 
following weaker version of Theorem 14. ![ item 1: P{An) > 

where c„ tends to some constant. Note that in this form, the theorem is void for 



infinite sample spaces. iJayned |1982| extends the original concentration phe- 
nomenon in a direction somewhat different from Theorem I4.lt it would be inter- 
esting to study the relations. 

Theorer n 14.31 is similar to the original 'cond itional limit theorems' (Theorems 



1 and 2) of Ivan Campenhout and Coverl 198l|. We note that the preconditions 



for our theorem to hold are weaker and the conclusion is stronger than for the 
original conditional limit theorems, the main novelty being that Theorem 14.31 
supplies us with an explicit bound on how fast m ca n grow as n te nds to infinity. 
The conditional limit theorem was later extended by ICsiszan 1984| . His setting is 
considerably more general than ours (e.g. allowing for general convex constraints 
rather than just moment constraints), but his results also lack an explicit estimate 
of the rate at which m can increase with n. ICsiszarl 1984| and I Cover and Thomas 
[199 Ij (where a simplified version of the conditional limit theorem is proved) both 
make the connection to large deviation results, in particular Sanov's theorem. As 
shown in the latter reference, weak versions of the conditional limit theorem can 
be interpreted as immediate consequences of Sanov's theorem. 



5 Consequences for Data Compression Games 

For simplicity we restrict ourselves in this section to countable sample spaces X 
and we identify probability mass functions with probability distributions. Below 
we make frequent use of coding-theoretic concepts which we first briefly review. 



5.1 Theorem 1 and Data Compression 

Recall that by the Kraft Inequality jCover and Thonia3 . 1991], for every prefix 
code with lengths L over symbols from a countable alphabet X^, there exists 
a (possibly sub-additive) probability mass function p over X"^ such that for all 
x^"'^ E X"', L{x^"^) = — logp(x''"^). We will call this p the 'probability (mass) 
function corresponding to L\ Similarly, for every probability mass function p over 
A*" there exists a (prefix) code with lengths L(x*^"^) = \—\ogp{x^"^)~\. Neglecting 
the round-off error, we will simply say that for every p, there exists a code with 
lengths L{x^'^^) = —\ogp{x^"^). We call the code with these lengths 'the code 
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corresponding to p\ By the information inequality Cover and Thomaa . Il991 
this is also the most efficient code to use if data X*^") were actually distributed 
according to p. 

We can now see that Theorem 14.11 item 2, has important implications for 
coding. Consider the following special case of Theorem 14. H which obtains by 
taking An = {x*^"^} and logarithms: 



Corollary 5.1 (the concentration phenomenon, coding-theoretic formu- 
lation) Assume we are given a constraint of form ^ such that T is of the lattice 
form and h = {hi, . . . , h^) is the span ofT and such that conditions 1 and 2 hold. 
For all n, all x*^"^ with ^"=i T{xi) = t, we have 



— logp(x 



+1 log 27171 + log VdetS - Y!]^^ log hj + o(l) 



- log g(x(") I i Er=i T{Xi) =t) + \ \ogn + 0(i: 



(14) 



In words, this means the following: let x*^"^ be a sample distributed according 
to Q, Suppose we are given the information that Ym=i ^(^j) — ^- Then, by 
the information inequality, the most efficient code to encode x^"-* is the one based 
on g(-|T(") = i) with lengths — logg(x*^"^ | T^") = i). Yet if we encode x^"-* us- 
ing the code with lengths — logp(-) (which would be the most efficient had x'-"^ 
been generated by p) then the number of extra bits we need is only of the order 
(/c/2) logn. That means, for example, t hat the number of additional bits we need 



per outcome goes to as n increases. iGriinwaldl |2001a| used Corollary 15.11 to 



establish a formal connection between the concentration phenomenon and uni- 
versal co ding, a cent r al con cept of information theory; this is worked out in more 



detail by IGriinwaldl |2007l |. Chapter 10, Section 2.2. In the present paper, we 



focus on the game-theoretic consequences of Corollary 15.1 [ 



5.2 Empirical Constraints and Game Theory 



Recall we assume countable X. The a-algebra of such X is tacitly taken to be 
the power set of X. The a-algebra thus being implicitly understood, we can 
define V{X) to be the set of all probability distributions over X. For a product 
= ^ieN-^ of ^ countable sample space X, we define 'P{X°°) to be the set of 
all distributions over the product space with the associated product a-algebra. 



Tops0d |l979l | and iGriinwald and DawidI |2004| provided characterizations of 
Maximum Entropy distributions quite different from the present one. It was 
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shown that, under regularity conditions, 



H,(p) 



sup 



inf 

p 



log 



inf 

p 



sup 

■.E^*[T]=t 



log 



p{X) 



(15) 

where both p and p* are understood to be members of V{X) and Hq(p) is de- 
fined as in ([T]). By this result, the MaxEnt setting can be thought of as a game 
between Nature, who can choose any p* satisfying the constraint, and Statis- 
tician, who only knows that Nature will choose a p* satisfying the constraint. 
Statistician wants to minimize his worst-case expected codelength (relative to q), 
where the worst-case is over all choices for Nature. In such game-theoretic con- 
texts, the codelength i s usua lly called "logarithmic score" or "logarithmic loss" 



Griinwald and Dawid. 2004 



It turns out that the minimax strategy for Statistician in f[T^ is given by p. 
That is, 

p{x) 



p = arg inf 



sup 



-log 



q{x) 



(16) 



Thus, p is both the optimal strategy for Nature and for Statistician. This gives a 
decision-theoretic justification of using MaxEnt probabilities which seems quite 
different from our concentration phenomenon. Or is it? Realizing that in practi- 
cal situations we deal with empirical constraints of form (jl]) rather than ([2]) we 
may wonder what distribution p is minimax in the empirical version of problem 
( IT6|) . In this version Nature gets to choose an individual sequence rather than a 
distribution. To our knowledge, we are the first to analyze this 'empirical' game. 
To make it more precise, let 



i=l 



Then, for n with C„ 7^ 0, p„ (if it exists) is defined by 

p(^Xi, . . . , Xn 



(17) 



inf ^ (18) 



p„ := arg inf sup -log"-— ■ = arg sup .... 

P&nx") ^(n)^c„ q{xi,...,Xn) p&v{x^) ^("'6C„ g(a;W) 

Pn can be interpreted in two ways: (1) it is the distribution that assigns 'maxi- 
mum probability' (relative to q) to all sequences satisfying the constraint; (2) as 
-log(]3(x("))/g(x("))) = Zir=i(~logP(^iki' • • -^^i-i) + logg(xi|xi, . . . ,Xi_i)), it 
is also the p that minimizes cumulative worst-case logarithmic loss relative to q 
when used for sequentially predicting 
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One immediately verifies that p„ = q"'{- \ T(") = t): the solution to the 
empirical minimax problem is just the conditioned prior, which we know by The- 
orems 14.11 and 14.31 is in some sense very close to p. However, for no single n, is 
p exactly equal to | T^^^ = i). Indeed, | T(") = i) assigns zero prob- 
ability to any sequence of length n not satisfying the constraint. This means 
that using q in prediction tasks against the logarithmic loss will be problematic 
if the constraint only holds approximately and/or if n is unknown in advance 
to the Statistician. In the latter case, it is impossible to use g(- | T^") = i) for 
prediction without modification. For suppose that the statistician guesses that 
the sample will have length rii for some ni with Cm 7^ 0- There exist sequences 
2,(n.2) _ 2;-^^ . . . ^ Xni, • • • , of length n2 > rii satisfying the constraint such that 
does not satisfy the constraint, and therefore g(a;'^"2^|a;^"^^ G C„J = 0, so 
g(- I a;*^"^^ G C„J = cannot be used for prediction if the actual sequence length 
turns out to exceed rii. We may guess that in this case (n not known in advance), 
the MaxEnt distribution p, rather than g(-|T(") = t) is actually a better distribu- 
tion to use for prediction. The following theorem shows that in some sense, this 
is indeed so: 

Theorem 5.2 Let X he a countable sample space. Assume we are given a con- 
straint of form ^ such that T is of the lattice form, and such that Conditions 1 
and 2 are satisfied. Let Cn he as in p7|j. Then the infimum in 



inf sup sup log ^1^^' yj^^ /-|^g\ 

is achieved by the Maximum Entropy distribution p, and is equal to Hg(p). 
Proof Let C = U^^Cj. We need to show that for all n, for all a;^") G C, 

1 p(a:W) . 1 
Hg(p) = log— ^= mf sup sup log— ^ (20) 

Equation fl20p implies that p reaches the inf in (1191) and that the inf is equal 
to Hg(p). The leftmost equality in (1?(71) is a sta r idard result about exponential 



families of form (Q; see for example, |Grunwaldl . l2007l |. To prove the rightmost 



equality in fl20l) . let x^^^ G C„. Consider the conditional distribution g(- | x^^^ G 
Cn)- Note that, for every distribution po over A'", pq{x^''^^) < q^x^^-^lx^""^ G C„) for 
at least one x^"'-' G By Theorem 14. II (or rather Corollary 15.11) . for this a;^"^ we 
have 

log , , > log , , logn — Oi—), 

n q[x^"')) n q[x^"->) 2n n 
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and we see that for every distribution pq over 

1 , poix^'^h 1 , 
sup sup log ^^--^ > sup sup log^-^^, 

which shows the rightmost equahty in (120!) . □ 

Theorem 15.21 shows that, among all distributions on the minimax codelength 
per outcome for sequences satisfying the constraints is achieved by the maximum 
entropy p. We may now ask whether it is also achieved by any different distribu- 
tion p', and if so, whether that distribution may even be "better" in the sense that 
it achieves strictly smaller codelengths on all sequences of all lengths that satisfy 
the constraints. Surprisingly, the answer depends on the number of constraints 
k: for k > 2, there exists such a p'. For k <2, there does not: 

Theorem 5.3 Assume we are given a constraint such that Condition 1 and 2 
both hold, X is finite, T is of the lattice form, and T = (T[i], . . . ,T[fc]) for some 
k > 2. Then (a), there exists a distribution p' and a constant d > 0, such that, 
for all large enough n with Cn 7^ 0, for all x^"^ G C„, 



p'{xu. . .,Xn) ^ , 
log — > c log n. 

p{Xi, ...,Xn) 

Moreover, (b ), there exists a distribution p" and a constant c" > such that for 
all n (and not just all large n) with C„ 7^ 0, for all a;^"^ G Cn, 

P (Xi, . . . , Xn) II 

log —, ^ > c . 

Theorem 5.4 Assume we are given a constraint such that Condition 1 and 2 
both hold, X is finite, T is of the lattice form, and T = (Tjij, . . . ,T[fc]) for some 
k < 2. Then there exists no distribution p' , such that, for all large n with Cn 7^ 0, 
for all G C„, 

p'{xu...,Xn) ^ „ 
log -37 r > 0. 

The upshot of these theorems is that, if it is known that the sample satisfies 
the constraint, but the sample size is not known, then, if A; > 3, there exist 
distributions which are guaranteed to compress the data more than p, so that 
the game-theoretic justification for predicting/coding with p is, to some extent, 
challenged. The proofs of both theorems make use of the following lemma, which 
we state and prove first: 
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Lemma 5.5 Under the conditions of Theorem \5.3\ and \5.4\ suppose there exists 
a distribution p' such that, for all large enough n with 7^ 0, for all x^"-* G C„, 

^^^ p'{xu...,Xn) ^21) 

p{Xi, . ..,Xn) 

Then there also exists a distribution p" and a fixed c" > such that for all n ( and 
not just for all large n) with Cn 7^ 0, for all x*^"^ G C„, log ^pl^^^''"'^"^ > c". 

Proof Let n* be the smallest n such that for all x*-"^ G C„, condition (12T1) holds. 
Let ni be the smallest n such that Cn is nonempty. Note that n* > ni. Let 
a = \n*/ni], i.e. n*/ni rounded up to the nearest integer. Then rii ■ a > n*, 
so that, with ?t,2 := ?T-i(a — 1), we have < 712 < n*. We only consider the case 
< ^2; the case n2 = is completely analogous, but easier. So assume ^2 > 0. 
Then C„2 is nonempty (to see this, take any x*^"'^^ G and note that the sequence 
consisting of (a — 1) repetitions of x'-"^^ must satisfy the constraint and therefore 
be in Cn^)- We may assume that 

inf ^A^<i, (22) 

(otherwise it would follow that n2 rather than n* is the smallest n for which 
condition fl?T]) holds, and we would have a contradiction). Now, let y^"'^'^ G C„2 be 
any sequence that achieves the infimum in fl2^ . Since by our condition, for all 
^(a-m) ^ ^^^^^ p'(x("-"i))/p(a;^"'"'^) > 1, and also p'(x(""i)) = p'(x„2+i, • • • 
a;("2)^ ■ p'(x''"^''), and also C„j is finite, it follows that 



We may now define a probability mass function p° on Af"^ such that, for z^^'^^ G 
C„,, p°(2("i)) is slightly smaller than P'(X„2+i = = z^, \ X^^^^ = 

whereas for z^""^^ G A^^^ \ Cn„ p°(^("^^) is slightly larger than P'iXn^+i = 
zi, . . . , Xani = Zn-^ \ X^^^^ = ?/*-"^^), whcrc wc iucrcasc and decrease the probability 
in such a way that the total probability on remains 1. Since is finite, 
using (l23ll . we can do this in such a way that for some £ > 0, for all z^^'^^ G C„j, 

P^>l+e (24) 

whereas for all z^''^'^ G X"-^ \ C„,, 

p°(z"i) 



P'(X„2+1 = ^1, . . . , = Zn, I = y(n2)) 



> 1 + (25) 
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We now extend p° to a probability mass function on X"^ by defining, for all z^"^^^ E 
for any m > 1, G A^"^, . . . , yi, . . . , y^) := p°(^("i))P'(^m+i = 

= I = ^("i)). Now, for any n > 1, x^") G A"", let m 

be the number of distinct initial segments Xi, . . . of x'-"^ with n' < n that 
satisfy the constraint, i.e. x^" ^ G C„/. Notice that we may have m = 0. We set 
So = 0, Sm+i = n, and, for j G {1, . . . is set such that x^^'^ G Cg^ and 

si < S2 < . . . < Sm < n. We define 

m 

One easily verifies by induction on n that p" defines a probability mass function 
on and, using ([22]) and ([2S]), that for all n with C„ ^ 0, all x'^ G C„, 

p"(x")/p(x") > 1 + e. The result follows. □ 

Proof (of Theorem 15.31) Let ni < n2 < . . . be the sequence of all n such that 
Cn 7^ 0. Define, for j = 1, 2, . . ., := g(- | T("j) = t). Let n be any distribution 
on the natural numbers such that, for all j G {1, 2, . . .}, 

-logvr(j) = logj + 0(loglog j). 



For e xample, we may take Rissanen's universal prior for the integers [Rissanen 
1989] . 7r(j) oc l/j(logj)^. Now set, for all ra, x" G A''^,p'(x") := Z]j=i,2,... 



Then p' uniquely induces a distribution P on that is actually a mixture of all 
conditional distributions of given that the constraint holds at some sample 
size for which it can hold at all. For each j G {1, 2 . . .}, each x^"'-'-' G C„., we have 



logp'(x^"^) = -log^vTigii^x 



in)) 

< - log 7r(j) - log g(x(") I TK) = t) 

k 

< logj + O(loglogj) - -logn - logp(x(")), 

where the final inequality follows by Corollary 15. 1[ Since j < n, Part (a) of the 
theorem follows. Part (b) is now an immediate consequence of Lemma 15.51 □ 

Proof (of Theorem 15. 4p Assume, by means of contradiction, that a p' as men- 
tioned in the theorem does exist. Then by Lemma 15.51 there also exists a p" , 
such that for all n with Cn 7^ 0, for all x'-"^ G C„, log ^pl^^^'"''^"-l > c" for some 
c" > 0. For < a < 1, define a probability distribution on in terms of its 
mass function pa, by Pa(x") := ap"{x^) + (1 — a)p{x^). Note that 

Pa{x") > max{ap"{x"), (1 - a)p(x")}. (26) 
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Now, for any n > 1, x*^""-* G A"", let m be the number of distinct initial segments 
of x*^"^ with n' < n that satisfy the constraint, i.e. x*^" ^ G C„'. Notice 
that we may have m = 0. We set Sq = 0, Sm+i = n, and, for j G {1, . . . , m}, Sj 
is set such that x*^"*-?^ G Cs and Si < S2 < . . . < < n. We define 



p"(x^"^j := JJp„(x,^+i,...,x,.+J. 



j=0 



One easily verifies by induction on n that (a) p° is the mass function of some 
probability distribution P° on X°°, and, using (126|1 . that (b) for all n, all x" G A*"^, 

(m-l \ 
JJ ap"(x,^.+i, . . . , x.^.^J j (1 - 

> a'"e'"""p(xi, . . . , x,„)(l - a)p(x,„+i,...,^.J 

= -a)2"^"p(x(")), (27) 

where c" is as in Lemma |5.5[ Now suppose that Xi,X2,... are i.i.d. ~ P, 
i.e. data are sampled from the MaxEnt distribution P. We may view Un '■= 
Yl^=i — t as specifying a Markov chain, where the state at time n is 

given by the value of Un, and the transition probabilities are given by P{Un+i = 
■ \ Un = u), for each realizable value of u, and the starting state is Uq := 0. 
By the local central limit theorem (Section Hj), the probability of being in state 
"0" at time n is of order (if A; = 1) or 1/n (if k = 2). In bot h cases, this 



probability is summable, so it follows by basic Markov chain theory [Felleii . 11968 
that state "0" is recurrent and with probability 1, Un = will hold for infinitely 
many n. But, for n > 0, Un = is equivalent to x*-"-* G C„, i.e. the constraint 
holds. It follows that the constraint will hold infinitely often, almost surely under 
P. Yet, if we decide to encode sequence Xi, . . . , x„ with the code corresponding 
to p° rather than p, then by fl271) . if we select a value of a < 1 such that a;2^ > 1, 
then we will P-almost surely compress the data significantly better than if we 
use the code with lengths — logp itself. More precisely, with P-probability 1, 

p(X") 
-log — --— oo. 

p°(X") 



But this contradicts the no-hypercompression inequality [Griinwaldl. |2007| falso 



known as the "competitive optimality of the Shannon- Fano code, " [Cover and Thomas 



1991| ). an easy consequence of Markov's inequality which states that for all 
K > 0, and any two distributions P and Q with mass functions p and q, for 
all n, P(- logp(X") > - logg(X") + K) < 2'^. The theorem is proved. □ 
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